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Abstract 

In undirected graphical models, learning the graph structure and learning the functions 
that relate the predictive variables (features) to the responses given the structure are two 
topics that have been widely investigated in machine learning and statistics. Learning 
graphical models in two stages will have problems because graph structure may change 
. after considering the features. The main contribution of this paper is the proposed method 

that learns the graph structure and functions on the graph at the same time. General 
graphical models with binary outcomes conditioned on predictive variables are proved to 
be equivalent to multivariate Bernoulli model. The reparameterization of the potential 
\Q . functions in graphical model by conditional log odds ratios in multivariate Bernoulli model 

' offers advantage in the representation of the conditional independence structure in the 

model. Additionally, we impose a structure penalty on groups of conditional log odds 
ratios to learn the graph structure. These groups of functions are designed with overlaps 
to enforce hierarchical function selection. In this way, we are able to shrink higher order 
interactions to obtain a sparse graph structure. Simulation studies show that the method 
is able to recover the graph structure. The analysis of county data from Census Bureau 
gives interesting relations between unemployment rate, crime and others discovered by the 
model. 
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C3 . 1. Introduction 



In undirected graphical models (Markov Random Fields), a graph G is denned as G = 
(V, E), where V = {1, • • • , K} is the set of nodes and E C V x V is the set of links between 
the nodes. In fact, V is associated to a set of multivariate response variables Y\, ■ ■ ■ ,Yk, 
and E specifies the conditional independence structure among them. For example, a link 
between two nodes i,j indicates a pairwise interaction f lJ . , and a clique between three 
nodes k indicates a third order interaction . These functions formulate the effects 
of predicative variables (features), X, on the responses and their interactions. 

Gra phical models have been used in many applications. Conditi onal Random Field s 
(CRF) ( Lafferty et ah . 200ll ) and their extensions, e.g. dynamic CRF ( Sutton et al. . 2007 ). 



are well known in Natural Language Processing community. The CRFs achieve great success 
in sequentially structured text, by modeling the interaction of labels (Y) on the nodes 
conditioned flexibly on the features {X). There are also numerous applications of graphical 
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models to computer vision (jSzeliski et al.l . 120071 ; iHonorio and Samaras! . Ising mode l 



is an other classical example that draws great interest in ima ge processing (IWilliams et al 



2004 ) and also has been recently applied to social networks ([Banerjee et all 12008). 



The intuition of utilizing a graph structure is that some responses are related while 
others are not. However, in many cases, the graph is pre-determined by domain knowledge. 
For example, Duan et al. (2008) proposed a collective model for labeling music signals 
with fully connected graph, which they called collective conditional random fields. They 
have 10 semantic categories such as genre (blues, rap, . . . ), instrument (guitar, piano, . . .), 
production (studio, live), rhythm(strong, weak, middle), and etc. It is possible that some 
links are not necessary: e.g. production and instrument. Estimating the parameters on 
these relations will lead to over-fitting. Therefore, graph structure learning is an important 
aspect of relation discovery in multivariate response applications and multi-task learning. 

Many p apers have focused on the g raphical model selection issue. Meinshausen and 



Buhlmann (200 6j) and Peng et al. 



2009) studied sparse covariance estimation of Gaussian 



outcomes (ISpeed and Kiiveril . 1 19861 ) without input features. The covariance matrix deter- 
mines the dependence structure in the Gaussian distribution and its sparsity specifies the 
linkage in Gaussian Markov Random Fields. This is not the case for non-elliptical distribu- 
tion, such as the distribution of discrete random variables. Ravikumar et al. ( 2010l ) focused 
on graph structure selection of Ising model based on l\ -regularized logistic regression. It 
gave sufficient conditions for consistently estimating the neighborhood of the nodes, with- 
out input features. However, two marginally independent response variables may become 
dependent after conditioning on X. So, ignoring the predicative variables may lead to incon- 
sistent estimation of the graph structure. To the best of our knowledge, there is no previous 
work addressed the issue of learning the graph structure and the functions associated with 
the graph at the same time. 

In this paper, our first contribution is the proof of the equivalence between the general 
graphical model with bivariate outcomes and multivariate Bernoulli (MVB) model. The 
functions that represent the effects of predicative variables on responses and their interac- 
tions (at all levels) can be formulated in MVB model, which is endowed with the advantage 
of interpreting the graph structure. It follows from the sparsity of links in the graphical 
models that some functions are constant zero, which means certain responses are condition- 
ally independent. Therefore, we impose the structure penalty on groups of functions with 
overlaps to obtain sparse estimation of the graph structure. These groups are designed to 
enforce the sparsity on the functions and shrink higher order interactions so that they only 
appear after their lower components have entered the model. 

The paper is organized as follows: Section 2 introduces graphical models and their rela- 
tion with multivariate Bernoulli model. Section 3 and 4 discusses the model for learning the 
graph structure and the functions on the graph through structure penalty. The experiments 
are shown and discussed in Section 5. Section 6 gives the conclusion and future work. 



2. Conditional Independence in Graphical Models 

The notations in this paper are summarized in Table [TJ 



2 



Table 1: Notations 



Symbol 


Description 


II ' II 


L2 norm 


n 


Sample size 


P 


Number of covariates 


K 


Number of Response/Output 


Y 


K dimensional Response/Output 





Set of {1,2,..., if} 




Power set of Q except the empty set 


m 


The highest level of interaction considered in the model, in 




this paper m = K 


Q 


Number of / w in the assumption, in this paper q = 2 K — 1 


Id, K, V 


Element of ^ k used for indexing 


y"(i) 


!/"(*) = n fcew !/*(**) 


yd) 


Augmented responses • • • , y (£)) 


c 


Model parameters 


p 


(p + 1) • q, dimension of c 


T v 


T v = {oj\v C uj} is the subgraph rooted at v containing all its 




descendants 




f T " = (n,L0£T v 




Penalty on f Tv 


Pv 


Weight for penalty on structure T v 




Subgradient of \J(f Tv ) of the vth group 


S"(y;x) 





In graphical models, G = (V,E), the distribution of multivariate discrete random vari- 
ables Yi, . . . , Yk is: 

P (Y l = yi ,...,Y K = VK \x) = — ?— H $ c (yc;X) (i) 

where Z is the normalization factor. The distribution is factorized according to the cliques 
in the graph. A clique C <Z £1 = {1,...,K} is the set of nodes of a fully connected 
subgraph. $c(,yc', X) is the potential function on C. It depends on yc = {y% \ i i G C} and 
the predicative variables X which are shared across all response variables. One example of 
application is to model the relations of multiple clinical responses (hypertension, diabetes, 
etc.) and how they are affected by the person's genetic variables and environmental variables 
(smoking, income, etc.). 

For the purpose of efficient computation, C is usually the set of all maximal cliques 
of the graph. The maximal clique is a clique that is not properly contained in any other 
clique in the graph. Different representations with C as the set of non-maximal cliques 
can b e converted to maximal clique representation by redefinition of the potential func- 
tions ( Wainwright and Jordan . 20081 ). Furthermore, C does not have to reflect the graph 



structure, as long as it is sufficient. For example, the most general choice for any given 
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graphical model is C = {£!}. The conditional independence between the response variables 
is implicitly formulated by the restrictions on the potential functions. See Theorem 12.21 and 
Example 12.11 for details. 

The Markov property states that any two nodes not in a clique are conditionally in- 
dependent given other nodes. For example, Y s ,Y t are conditionally independent given all 
other variables that block the path from Y s toYf. Therefore, C as the set of maximal cliques 



factorizes the graph and specifies the conditional independence in the model. In Figure 1(a) 



we have 2 cliques {1,2,3} and {3,4}. In this case, {l^,!^} are conditionally independent 
given y 3 , so are {Yi,^}. 




(a) 




(b) 





(d) 



Figure 1: Graphical Model Examples, (a) Model 1: The triangle clique {Yi, Y2, Y3} indicate 
a third order interaction. Additionally, there is a pairwise interaction between 
Y3 and Y4. Y4 is conditionally independent with Y\ or Y2 given Y3. (b) Model 
2: Y5,Y6 are another set of interacted random variables which are independent 
of other 4 nodes, (c) Model 3: Y$,Yq, Yj,Y$ form a 4- node clique that connected 
to Yi,Y2 ; Y3 through Y4. (d) Model 4: Y4, • • • ,Y§ form a strongly connected 
component, and Y9, Y\q are independent of other nodes 



Given the graph structure, the potential functions are convenient to characterize the 
distribution on the graph. However, if the graph is unknown in advance, estimating the 
potential functions does not give direct inference of the graph structure, because there are 
different representations with different choices of cliques in the same graph as mentioned 
above. Even if assuming the most general case where C = {f2}, the conditional independence 
between the nodes cannot be represented by &c i n a simple form (see Example l2.ip . which 
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makes the learning of graph structure difficult. The multivariate Bernoulli (MVB) model 
can represent a graphical model whose nodes are Bernoulli random variables, i.e. Y{ = 
or 1. And the parameterization in MVB model is suitable for learning the graph structure. 
We will show later that it is equivalent to GM (JTJ) with binary outcomes. The distribution 
of MVB model is: 

p(Y l = y l ,...,Y K = y k \X) 

= ex V { yi f + ■■■+ y K f K + ■■■+ y 1 y 2 f 1 ' 2 + • • • + y x . . . y K f'-' K - b(f)} 

2 K -l 

= exp{ y"r ~ Kf)} (2) 
w=l 

Here, we use the following notations. Let \I> k be the power set of 0, = {1, . . . , K}. Count- 
ing the empty set, there are 2 K elements in ^>k- Where convenient in what follows, we 
will relabel these elements from to 2 K — 1, e.g. for K = 3, we will use Z 1 ' 2,3 and f 7 
interchangeably without further specification. Because there are 2 K — 1 free parameters 
in (|2|), denote ^ k = ^> K — {0} for simple notation. Let u denotes a set in ^>k-, define 
y = (y 1 , ■ ■ ■ , • • • , y n ) be the augmented response with 



And / = (J 1 , ... , f u f ) be the vect or of natural parameters, where f u (x) is the 

conditional log odds ratio (|Gao et al.l . l200lh to be estimated 

/" = \ogOR(Y h i e u\Yj =0,j$ u;X) (4) 

Here, the odds ratios are calculated recursively 

OR(Y„ i e u u {*» = supP ose * i a (6) 

Later, we will call f l , ■ ■ ■ , f K main effects, and J 1 ' 2 , • • • , f l, "' K measures the interactions 
between the response variables. We are interested in the sparse esti mation of / u , which is in 
a Reproducing Kernel Hilbert Space (RKHS) with kernel K" (jWahbal . fl99ol ) . Denote: 

S"(y;x) = ^2y K r(x); S^x) = £ /"(a;); (7) 

Then the normalization factor is: 

exp(6(/(x))) = 1 + Y, exp(5-(x)) (8) 

And we have the following lemma: 
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Lemma 2.1 In multivariate Bernoulli model, define the odd-even partition of the power 
set of uj as: ^^ dd = {k C uj \ \k\ = \uj\ — k, where k is odd}, and ^e Ven {K C uj \ \n\ = 
\uj\ — k, where k is even}. Note l^-wl = l^e^enl = 2l w l _1 , the natural parameters have the 
following property: 

= U Ke ^ ven P( Y i = and Yj = 0, j G - k\X) 

° g U K ^ odd P( Y i = 1,* G «, and Yj = 0,j EQ — k\X) 1 ' 

and 

_ P( Y i = M e uj, and Yj = 0,j G fi- 

The equivalence between graphical models and MVB model is given in the following theo- 
rem. 

Theorem 2.2 Graphical model of general form (OP with 0/1 nodes is equivalent to multi- 
variate Bernoulli model |0|). ylnd the followings are equivalent: 

1. There is no \C\-order interaction in {Yi,i G C}. 

2. There is no clique C G *$>k i n the graph. 

3. f^ = for all uj such that C C uj. 

A proof is given in Appendix A. Theorem 1 2 . 2 1 states that there is a clique C in the graphical 
model, if there is uj D C, f u ^ in MVB model. The conditional independence specified in 
the graphical model can be fully formulated by MVB model. 

Example 2.1 For a graph with K nodes, the parameters in GM are {§ u \ uj G ^ k}, where 
= &n(Yi = l,i G uj, and Yj = 0, j G 0, — uj) is the potential function. We usually restrict 

$0 = 1 to make the model identifiable. So there are 2 K — 1 free parameters. Similarly, there 

are also 2 K — 1 free parameters in MVB model (f 1 , . . . , f n ) 

When K = 2, £1 = {1,2},C = {£1}, define $xi to be the potential function $q(Yi = 

1,Y 2 = l',X) for simplicity, and define $io, $oi; ^00 similarly. The probability distribution 

with the GM parameterization is 

p(Y 1 = 1,Y 2 = 1\X) = i* u , p(Y 1 = 1,Y 2 = 0\X) = ~$ 10 , 
p{Yi = 0,Y 2 = 1\X) = i*oi, p(Xi = o,y 2 = 0|X) = i$oo 
The relation between GM and MVB model is 

/' = log§S / 2 =log|51, /^ = log|ii^ 

5>00 5>00 3>01 ' *10 

Note, the independence between Y\ and Y 2 implies 

/' =0 or log- — = (11) 

5>01 • 1*10 

Therefore, the sparseness in the conditional log odds ratios in MVB model gives a direct 
inference of the graph structure. But this property does not apply to GM. 
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3. Structure Penalty 

In many applications, the assumption of graphical models is that the graph structure has 
few large cliques. It is equivalent to the sparsity in higher order interactions in MVB model 
by Theorem 12.21 So we will impose a sparse penalty on the dependence structure to shrink 
higher order interactions. 

Let y(i) = (yi(i), ■ ■ ■ ,yx(i)), x(i) = (xi(i), . . . ,x p (i)) be the ith data point. The aug- 
mented representation of the multivariate responses is: 

y{i) = (y x (i),...,y"(i),...,y Q (i)) (12) 

There are I^jcI = 2 K — 1 components in totalE Denote the number of components by q. 
In this paper, we consider the learning of the full model where q = \^k\- Supp ose each 



funct ion is in a Reproducing Kernel Hilbert Space (RKHS) if w with kernel K u (jWahba 
1990h . The general penalized log likelihood model is: 



min/ A (/)=L(/) + AJ(/) (13) 
where the first term is the negative log-likelihood: 



i=i 



and the second term «/(•) is the structure penalty. 

The objective is to obtain sparse estimation of the cliques by structure penalty on /. 
Consider the pairwise links. No link between Y s ,Yt in the graphical model means / w = 



for all ijj D {s,t}. For example, in Figure 1(b), Y\,Y± are conditionally independent means 



/ i f 1,2 ' 4 'i Z 1 ' 3 ' 4 ) Z 1 ' 2,3 ' 4 are all zero. This objective is similar to sparse covariance matrix es 



timation in Gaussian data for neighborhood selection with lasso (jMeinshausen and Buhlmann 



2006) 



However, our model will deal with higher order covariance structures that do not 
exist in Gaussian data. In addition, we not only consider the graph structure of responses 
Y alone, but also the functions of predicative variables X on Y . 

To satisfy this intuition, the penalty is designed to shrink large cliques in the graph. 
Suppose in the true model, there is no interaction on clique C, then all / w should be zero, 
for C C u). The penalty is designed to shrink such f w to ze ro. The idea can be viewed as 



group lasso with overlaps. Group lasso (jYuan and Lin . 2006) leads to selection of variables 



in groups. It has consis tent e stimation when the groups are exclusive and union to the 
whole set. J acob et al. ( 20091 ) considered the penalty on groups with arbitrary overlaps. 



Zhao et al. tod ) set up the general framework for hierarchical variable selections with 



overlapping groups, which we adopt here for the functions. 

We consider the penalty guided by the structure in Figure [H The guiding graph T has 
2 K — 1 nodes: 1, . . . ,oj, . . . ,Q. With some abuse of notation, we use the element in to 
index the node in T. There is an edge from uo\ to 0J2 if an d only if u\ C 0J2 and \ui \ + 1 = \u>2 1 • 
Domain knowledge can be applied here to design a different guiding structure. Jenatton et 
al. (|2009h discussed how to define the groups to achieve different nonzero patterns. 



1. In applications with large graphs, we only consider up to m'th interactions. We truncate higher order 
interactions and get X^fcLi ( uj functions 
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Figure 2: Hierarchical Guiding Structure for Penalty 



Let T v = {uj £ ^k\v C u} be the subgraph rooted at v in T, including all the descen- 
dants of v. Denote f Tv = oj G T v . All the functions are categorized into groups with 
overlaps as G = (Ti, . . . , Tq). The structure penalty on the group T v of functions is: 

where is the weight for the penalty on T v chosen empirically as pr-r. Then the objective 
function is: 

min I x (f) = L(f) + \J2Pv Jj2 ll/il^ (16) 

The following theorem shows that by solving the objective (|16p . / Wl will enter the model 
before f U2 if uj\ C 0J2- That is to say, if f Ul does not exist, there will be no higher order 
interactions on lo<i- The proof is given in Appendix A. 

Theorem 3.1 Objective H6\) is convex, thus the minimal is attainable. Let uji,uj2 € VP Kind 
oj\ C 0J2- If f* is the minimizer, that is € dl\(f*) which is the subgradient of I\ at f*, 
then f*" 2 = almost sure if = 0. 

Example 3.1 If K = 3, f = (Z 1 , / 2 , / 3 , Z 1 ' 2 , Z 1 ' 3 , / 2 ' 3 , Z 1 ' 2 ' 3 ). The grouped functions at 
node 1 in Figure\^is f Tl = (f 1 , f 1 ' 2 , f 1,3 , f 1 ' 2,3 ). The objective is: 

min -i(y,f) + A^VlIZ 1 !! 2 + HZ 1 - 2 !! 2 + HZ 1 - 3 !! 2 + HZ 1 - 2 * 3 !! 2 

+P2v / ||Z 2 ll 2 + IIZ 1 ' 2 ll 2 + IIZ 2 ' 3 ll 2 + IIZ 1 - 2 ' 3 ll 2 

+P3v / IIZ 3 ll 2 + IIZ 1 ' 3 ll 2 + IIZ 2 ' 3 ll 2 + IIZ 1 ' 2 - 3 ll 2 (17) 

+vW\\n 2 + HZ 1 - 2 - 3 !! 2 +P5v / IIZ 1 - 3 ll 2 + IIZ 1 ' 2 ' 3 !! 2 

+WIIZ 2 < 3 II 2 + IIZ 1 ' 2 ' 3 II 2 + PiVW^W) 
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Algorithm 1 Proximal Linearization Algorithm 



Input: co,ao>C > l,tol > 
repeat 

Choose a k € [a m i n ,a max ] 
Solve Eq {19} for d k 

while 8 k = I\{c k ) - I x (c k + d k ) < \\d k \\ 3 do 

/ / Insufficient decrease 

Set a k = max(a min , Qa k ) 

Solve Eq (H]) for d k 
end while 
Set a fc+ i = a k /( 
Set Cfc+i = c k + d k 
until 5 k < tol 



4. Parameter Estimation 

In this paper, we focus on the situation where the wth function space is = {1}@T-Lf. {1} 
refers to the constant function space, and T-tf is a RKHS with a linear kernel. The function 
f u G W has the form: f w (x) = c% + Ej=i c j x j- Its norm is = ll^ll 2 - Here > we 

denote & 1 = (eft, . . . , c£) T G M p+1 as a vector of length p + 1 and c = (c UJ ) u)e ^, K G be the 
concatenated vector of all parameters of length p = (p + 1) • q. Hence, the objective (|16|) is 
now: 



mm 



I x (c) = L(c) + A Y,Pv\\c Tv II (18) 



where c Tu = (c w ) ajg T'„ is a (p + 1) ■ (T^l vector. 

To solve (|18p . we iteratively solve the following proximal linearization problem (jWrighti . 
20ld ): 



minLfc + VL^(c - c fc ) + ^||c - c fc || 2 + AJ(c) (19) 

where L^, = L(c k ), a k is a positive scalar chosen adaptively. With slight abuse of notation, 
we use c k to denote the vector of all parameters at A;th step. Algor ithm Q] summarized the 
framework of solving (]18p . Following the analysis in Wright (|20ld ). we can show that the 



proximal linearization algorithm will converge for negative log-likelihood loss function plus 
group lasso type penalties with overlaps. 

However, solving group lasso with overlaps is not trivial due to the non-smoothness at 
the singu lar point. In recent years, several papers have addressed this problem. Jacob et 
al. (|20Q9h duplicated the variables in the design matrix that appear in group overlaps, then 
solved the problem as group lasso without overlaps. Kim and Xing ( 20101 ) reparameter- 



ized the group norm with additional dummy variables. They alternatively optimized for 
the model parameters and the dummy variables at each iteration. The method performs 
efficiently on quadratic loss function for Gaussian data. But optimizing alternatively over 
two sets of parameters might not scale well on penalized logistic regression. 
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In t his pa per, we solve (|19p by its smooth and convex dual problem proposed by Liu 
and Ye (|2010l ). Let Z = {v e ^>k\ \\c Tv \\ = 0}, and Z = ^ K - Z be the complement. Define 



s v , v G as: 

s v e ^ = {s = (s")^^ | s G RP, ||s|| < Ap„, s u = if w G T„} (20) 
then the subgradient of (fl"9j) is: 

VL + a fe (c-c t ) + 5]s„ + 5]r u (21) 

vez uez 

where s v is the subgradient of Ap^Hc 71 " || for w£Z and r u is the subgradient for u £ Z: 

r u = argmax Su (s u , c), for u £ Z (22) 

The subgradient s v is in a unit ball of certain subspace of W p . These subspaces are not 
perpendicular to each other. Thus, s v 's are not separable, and closed form solution of (fl"9j) 
cannot be obtained. We solve the proximal subproblem (|19h by its smoothing and convex 
dual problem. Note (|19p is equivalent to: 

min maxd>(c, S) = (23) 



VL k (c- c k ) + —\\c- c k \\ + 2_^{s v 



where S is a p x q matrix whose columns are s v . 8 = {S\S = (si, . . . , s V) . . . , sq), s v G 
S„ for v G *S?k} is the feasible region of S. Since <p(-, S) is lower semicontinuous and (j)(c, •) 
is upper semicontinuous, there exists a saddle point and the max and min are exchangeable. 
The solution of minimizing <fi(c, S) is: 



,,S)=c k -—VL k -—Y /Sv (24) 



a k a k 

Substitute c back into ([23]) . we have the dual problem of (fT9j) as: 

maxr/(5) = -^|| y^s„|| 2 + (a k c k -VL k ) T ^2s v (25) 



Following the proof in Liu and Ye (|2010h . we can show that r](S) is convex and Lipschitz 



continuous. The differential is a k ce T where e G MP is a vector of ones. Hence, (I25p can 
be solved by exi sting grad i ent m ethods. We use the accelerated gradient descent method 
implemented in (|Liu et allboOflh . 



5. Experiments 

5.1 Simulation Study 

The simulations are performed to evaluate the learning accuracy of our method. The graphs 
of the response variables are depicted in Figure [TJ In the simulation, we assume the most 
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general distribution family for each graphical model according to its graph structure. For 
example, in Model 1, we have 4 response variables, (Y\, Y2, 13, Y4). There is a triangular 
clique (Y"i, Y2, Y3), but I4 is independent with the other response variables. In this case, we 
have a 15 conditional log odds ratios to estimate. In the true model, the non-zero functions 
are {f 1 , f 2 , f 3 , J 1 ' 2 , f 1 ' 3 , f 2 ' 3 , J 1 - 2 ' 3 , f 4 }. In Model 3, there are 255 functions and 25 of them 
are nonzero. In Model 4, there are 1023 functions and 25 of them are nonzero. 

The predictive variables X = (Xi, . . . , X$) are independently generated from multivari- 
ate Gaussian distribution with mean and variance 1. Each f u has 6 parameters to estimate 
(1 of them is the intercept). These parameters, dj ,j = 1, • • • ,5, are uniformly sampled from 
{—5, —4, • • • , 5}. We set the intercepts Cq in main effects to 1, those in second or higher 
order interactions to 2. Each Y is randomly selected proportional to the probability in 
equation ((2J), where / = Xc. We generate 100 datasets for each graph structure in Figure [Q 
to evaluate the learning accuracy. The sample size in each dataset is 1000. 

The tuning parameter A is chosen by two different tuning methods: 1) GACV (general- 
ized approximation cross validation), 2) BGACV (B-type GACV). The details of the tuning 
methods are discussed in Appendix [Bj 

Table 2: # of True Positive and False Positive Functions 











/ :u .r 






FP 










Model 1 








GACV 


100 


100 


100 


99 89 






123 


BGACV 


83 


89 


86 


68 17 






5 










Model 2 








GACV 


100 


99 


100 


99 83 






321 


BGACV 


92 


93 


92 


83 34 






46 










Model 3 








GACV 


95 


89 


79 


98 57 


77 


31 


394 


BGACV 


37 


18 


21 


92 


36 





143 










Model 4 








GACV 


91 


98 


96 


88 59 


49 




654 


BGACV 


70 


72 


69 


66 







134 



In Table [21 we count, for each function f w , the number of runs out of the 100 repli- 
cations in which f u is recovered (||c^|| 7^ 0). The recovered functions in the true model 
are considered as true positive; while the others not in the true model are false positive. 
Since the main effects are always detected correctly, they are not listed in the table. The 
structure penalty is efficient in recognizing strong interactions in the responses, such as the 
interaction between Y\ , Yi . But its performance on higher order interactions will be affected 
in more complex graph structures, e.g. f 1 ' 2,3 in Model 3 and 4. 

Compared to GACV, BGACV tends to achieve more sparse results in general, because 
they have large penalties on the degrees of freedom of the estimated model. On the contrary, 
GACV will discover more true positive functions with the cost of higher false positive rate. 

In Figure El we show the learning results in terms of true positive rate (TPR) with 
increasing sample size from 100 to 1000. The experimental setting is the same as before. 
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Figure 3: True Positive Rate (TPR) of Graph structure learning with increasing sample size 
from 100 to 1000 by different tuning methods: GACV, BGACV. The TPR curves 
(GACV) have converged for Ml, M2 after 500 samples. But for M3 and M4, larger 
sample size is required to achieve convergence, since there are 1023 functions to 
estimate. BGACV is more conservative in selecting non-zero functions. 



The TPRs are improving along with the increasing sample size. Compared to Model 1 and 
2, the algorithm needs substantially larger sample size to achieve high TPR on Model 3 
and 4. GACV achieves better true positive rate in all four graphical models. This tuning 
method will obtain less sparse models compared to BGACV. 

5.2 Census Bureau County Data 

We use the county data from U.S. Census BureavH to validate our method. It includes 
demog raphic data for all co unties in United States, covering population, employment, 
votes ( Scammon et al. . 20051 ). and etc. We delete the counties which have missing value in 



the columns we are interested, and get 2668 data entries in total. The outcomes for this 



2. http : //www. census . gov/statab/www/ccdb.html 
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Table 3: Selected Response variables 



Response 


Description 


Positive% 


Vote 


2004 votes for Republican presidential candidate 


81.11 


Poverty 


Persons in poverty 


52.70 


VCrime 


Violent Crime, eg. murder, robbery 


23.09 


P Crime 


Property Crime, eg. burglary 


6.82 


URate 


Unemployment Rate 


51.35 


P Change 


Population change in percentage from 2000 to 2006 


64.96 



study are summarized in Table [3l "Vote" is coded as 1 if the Republican candidate won 
in the 2004 presidential election. To dichotomize the rest response variables, the national 
mean is selected as a threshold. The third column in the table gives the percentage of 
positive in the data. The features in the model are: percentage of housing unit change; 
government expenditure; population percentage of 3 ethnic groups (White, African, Asian), 
people foreign born, people over 65, people under 18, people with high school education, 
and people with bachelor degree; birth rate; death rate. 

In the experiment, We first standardize the data to be mean variance 1. Then, we 
can get an estimated graphical model with every fixed A. Adjusting the regularization 
parameter A from 0.3289 to 0.1389, we will discover new interactions entering the model. 
The graph structure of A = 0.1559 (chosen by cross validation) is shown in Figure HI The 
first number on edge indicates the order of the link entering the model, while the second 
one is the corresponding A. The unemployment rate plays an important role as a hub in the 
graph. It is strongly related to poverty and crimes. Population change is negatively related 
to violent crimes, as well as to unemployment rate. 




Figure 4: Correlation of Response in the county data from Census Bureau. The first number 
on the edge shows the order at which the link is recovered. The number in bracket 
denotes the corresponding A. 
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We analyze the link between "Vote" and "PChange", which is recovered by our method 
before any other links. The marginal correlation between them (without conditioning on 
X) is 0.0389, which is the second smallest a bsolute value in the correlation matrix. The 
partial correlation method ( Peng et al. . 2009) is taken as an example to show how the links 
of response variables are discovered without considering X. The link between "Vote" and 
"PChange" is the tenth recovered link using R package space. The reason is that after taking 
features into account, the dependence structure of response variables may change. The 
main contribution in this case is "percentage of housing unit change" (X\) and "population 
percentage of people over 65" {X2). Part of the fitted model is shown below: 



^ vote = 0.0463 • Xt + 0.0877 ■ X 2 + ■ ■ ■ 

jPChange = Q _ 2 S15 ■ X X - 0.0942 ■ X 2 + ■ • ■ 
fVote,PChange = Qmn . ^ _ 0.0115 ■ X 2 + ■ ■ ■ 



The main effects suggest that with increase of housing units, the counties tend to increase 
in population and vote for Republican. With increase of people over 65, the counties tend 
to lose population, but still more likely to vote for Republican. The interaction function 
reveals that as housing units increase, the counties are more likely to have both positive 
results for "Vote" and "PChange" . But this tendency will be counteracted by the increase 
of people over 65: the responses are less likely to take both positive values. 



6. Conclusions and Future Work 

The structure penalty on the multivariate Bernoulli model can efficiently learn the graph 
structure, which indicates the conditional independence of the response variables. In this 
paper, we only consider linear models for the conditional log odds ratios. It can be ex- 
tended to smoothing spline ANOVA model with more freedom by choosing the kernels. 
And Theorem 13.11 holds natually. It is also interesting to see how the penalty will improve 
the prediction power compared to large margin methods. 

Another extension will be the selection of features for each f u . The sparsity in each 
function requires the sparsity within each group. But the graph structure would change with 
different selected features. One application is to discover the relations of multiple symptoms 
or clinical responses and how they are effected by the environmental and genetical covariates. 
Smoking could be significant for many diseases and their interactions, but other covariates, 
such as taking Vitamin might be only related to a subset of the symptoms. As a result, we 
will investigate sparse penalties within each function. 
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Appendix A. Proof 



A.l Proof of Theorem [2H 

Proof Given graphical model ([T]) , let y^ be a realization of yc such that = {yf \ i E C} 
where y^ = 1 if i E uj and yf = otherwise. Let the odd-even partition of the power set of 
uj defined as in Lemma 12. 11 The conditional log odds ratios in MVB model are: 

K/) = iog Z{x) 



Ucec ®c(0;x) 

Conversely, given the MVB model of [21 the cliques can be determined by the nonzero f u : 
clique C exists if C = uj and f u ^ 0. Then the maximal cliques can be inferred from the 
graph structure. And suppose they are C±, . . . , C m . Let be ifi(uji) = Ci, i = 1, . . . ,m, and 
K\ = 0, Ki be tjj(Ki) = Cjn(Cj_iU- • -UCi), i = 2, . . . , m. Then one possible parameterization 
is: 

*Ci (.VG ; x) = exp (S^ (y; x) - S Ri (y; x)) (27) 
Z(x) = exp(6(/)) 

Therefore, graphical model ([TJ with bivariate outcomes is equivalent to the MVB Model 
©• 

In the latter part of the theorem, 1 2 and 3 1 follows naturally from the Markov 
property of graphical models. To show 2 => 3, notice that whenever k(1C = n'nC, y*Q = y^. 
For any possible v = k n C, k' E {k\k = v U u, s.t. u C uj — v} will give k' nC = v. There 
are 2' w ~ t ' such k' in total due to the choice of u. Also, they appear in the nominator and 
denominator of equation (f26|) equally. So, for any C E C, 

[] <Mi/c;*)= II Mi/fr*) (28) 

It follows that f u = by fl26|). ■ 



A.2 Proof of Theorem I3TT1 

Proof We give the proof for the linear case. The convexity of I\ is easy to check, since 
L and J(f Tv ) are all convex in c. Suppose there is some ui2 D uj\ s.t. c*^ 2 7^ 0, by the 
groups constructed through Figure ||c* T "|| ^ for all v Cu)\. So the partial derivative of 



objective (|18j) with respect to c Wl is 



+ A ^^^t^ = ( 29 ) 



Q^jjj 1 / lie* ^ v 



Hence, the probability of {3w2 3 wi s.t c™ 2 7^ 0} equals to the probability that gjsr = 0, 
which is 0. ■ 
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Appendix B. Tuning 

For i-th data, we have: 



S? = S u (x{i)) 

= &(/(*(»)))= log + exp 



(30) 
(31) 



Then, the mean of the augmented response y(i) in the multivariate Bernoulli model is: 



M (i) = E[y(i)\x(i),f] = (//(*'), • • • , m"(0, • • • , m q «) 

where /x (i) = — — = ^— 

a/ re exp bi 

The g x q covariance matrix of the augmented response is: 

W(i) = var(y(i)\x(i)J) 
where the (a, /3)-th element of W(i) is: 



d 2 h _ £ wG T a ni> ex PS?' 



exp 6j 



(32) 
(33) 

(34) 
(35) 



Let i?„ be a p x p diagonal matrix whose (i, i)-th element is 1 if Cj ^ 0. Then, the v-th 
group penalty in (fT8j) can be written as: 



j(f n )=Pv \\n\ 2 =pv\\Rvc\\ 



(36) 



Note R v is symmetric and R v - R v = R v , direct calculation yields the derivative and Hessian 
of the penalty term: 



8J \ - RyC 

oc ^ , \\R v c\\ 



d 2 j 

dcdc T 



R v (\\R v c\\ 2 I -c-c T )R v 
ll^cll 3 



(37) 
(38) 



where J v = (R v (\\R v c\\ 2 I — c ■ c T )R v )/\\R v c\\ 3 . Denote the grand design matrix as: 



D=(D(1) T ■■■ D(n) T f 



where D{i) 



(x(i) T 
x{i) T 

\ 



o \ 


x(i) T J 



(39) 
(40) 
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Suppose there are N non-zero elements of c at location {a±, . . . , a^}. Let D be the matrix 
composed by the ai, . . . , aArth column of D. Then, the Hessian matrix of I\ in (|16|) is: 



d 2 h 



d 2 L 



X- 



d 2 j 



dcdc T dcdc T dcdc T 



D T WD + X VvJv 

V.RyC^O 



Let H be the nq x nq influence matrix that implies 

fx,e -fx -He 



(41) 
(42) 

(43) 



where e is a small perturbation on y; fx = Dc\ is the estimated function value with tuning 
parameter A; and /x ]£ is the estimated function value with the perturbation. Then, the 
analysis of the first order Taylor expansion o f -i^( y_+ e . c\A leads to the formulation of H 
as follows (refer to ( Xiang and Wahba . 19961 ) and [Ma, 2O10l ) Chapter 3 for more details) 



H = D 



d 2 h \ 



-i ~, 



> dcdc T 

The (i,j)-th q x q submatrix of H is 



j D T = b(b T wb + A Yj Pv j v) D q 



-1 ~, 



(44) 



(45) 



Let Q(i) = I — H(i, i)W(i) for i= 1, . . . , n, define the generalized average matrix, denoted 
as Q, of {Q(i),i = 1, . . . , n} as follows 



Q = (5- j)I qxq + 7 • ee 

where e is the unit vector of length q and 

1 



(5 7 

7 5 

\l 7 



7\ 

7 

6J 



(46) 



n Q22i=i tr {Q(V) nq(q-l) L 
Let H be the generalized average of {H(i, = 1, ■ ■ ■ , n}, the GACV score is 

1 n 

GACV(X) = OBS(X) + -^2y(i) T Q- 1 H(y(i) - 



(47) 



where 



OBS(X) = - - y(i) T h(x(i)) + b(f x (x(i))) 
n L 



(48) 



(49) 



is the observed log-likelihood. 
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The degrees of freedom of multivariate Bernoulli data is general ly difficult to obtain. 
But we can have a good approximation from GACV ( Shi et al. . 20081 ) as 



(50) 



i=i 



So the BGACV score can be defined as 



BGACV(X) = OBS(X) + Yl y(i) T Q- l H(y(i) - /i(i)) 



n 2 



(51) 
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